Search CORE

18 research outputs found

Incorporating translation quality-oriented features into log-linear models of machine translation

Author: Penkale Sergio
Publication venue: Dublin City University. School of Computing
Publication date: 01/11/2011
Field of study

The current state-of-the-art approach to Machine Translation (MT) has limitations which could be alleviated by the use of syntax-based models. Although the benefits of syntax use in MT are becoming clear with the ongoing improvements in string-to-tree and tree-to-string systems, tree-to-tree systems such as Data Oriented Translation (DOT) have, until recently, suffered from lack of training resources, and as a consequence are currently immature, lacking key features compared to Phrase-Based Statistical MT (PB-SMT) systems. In this thesis we propose avenues to bridge the gap between our syntax-based DOT model and state-of-the-art PB-SMT systems. Noting that both types of systems score translations using probabilities not necessarily related to the quality of the translations they produce, we introduce a training mechanism which takes translation quality into account by averaging the edit distance between a translation unit and translation units used in oracle translations. This training mechanism could in principle be adapted to a very broad class of MT systems. In particular, we show how when translating Spanish sentences into English, it leads to improvements in the translation quality of both PB-SMT and DOT. In addition, we show how our method leads to a PB-SMT system which uses significantly less resources and translates significantly faster than the original, while maintaining the improvements in translation quality. We then address the issue of the limited feature set in DOT by defining a new DOT model which is able to exploit features of the complete source sentence. We introduce a feature into this new model which conditions each target word to the source-context it is associated with, and we also make the first attempt at incorporating a language model (LM) to a DOT system. We investigate different estimation methods for our lexical feature (namely Maximum Entropy and improved Kneser-Ney), reporting on their empirical performance. After describing methods which enable us to improve the efficiency of our system, and which allows us to scale to larger training data sizes, we evaluate the performance of our new model on English-to-Spanish translation, obtaining significant translation quality improvements compared to the original DOT system

Irish Universities

DCU Online Research Access Service

MATREX: the DCU MT system for WMT 2009

Author: Du Jinhua
He Yifan
Penkale Sergio
Way Andy
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2009
Field of study

In this paper, we describe the machine translation system in the evaluation campaign of the Fourth Workshop on Statistical Machine Translation at EACL 2009. We describe the modular design of our multi-engine MT system with particular focus on the components used in this participation. We participated in the translation task for the following translation directions: French–English and English–French, in which we employed our multi-engine architecture to translate. We also participated in the system combination task which was carried out by the MBR decoder and Confusion Network decoder. We report results on the provided development and test sets

CiteSeerX

Irish Universities

DCU Online Research Access Service

Accuracy-based scoring for phrase-based statistical machine translation

Author: Galron Daniel
Ma Yanjun
Penkale Sergio
Way Andy
Publication venue: Association for Machine Translation in the Americas
Publication date: 01/01/2010
Field of study

Although the scoring features of state-of-the-art Phrase-Based Statistical Machine Translation (PB-SMT) models are weighted so as to optimise an objective function measuring translation quality, the estimation of the features themselves does not have any relation to such quality metrics. In this paper, we introduce a translation quality-based feature to PBSMT in a bid to improve the translation quality of the system. Our feature is estimated by averaging the edit-distance between phrase pairs involved in the translation of oracle sentences, chosen by automatic evaluation metrics from the N-best outputs of a baseline system, and phrase pairs occurring in the N-best list. Using our method, we report a statistically significant 2.11% relative improvement in BLEU score for the WMT 2009 Spanish-to-English translation task. We also report that using our method we can achieve statistically significant improvements over the baseline using many other MT evaluation metrics, and a substantial increase in speed and reduction in memory use (due to a reduction in phrase-table size of 87%) while maintaining significant gains in translation quality

CiteSeerX

DCU Online Research Access Service

Evaluating syntax-driven approaches to phrase extraction for MT

Author: Groves Declan
Penkale Sergio
Srivastava Ankit Kumar
Tinsley John
Publication venue
Publication date: 01/01/2009
Field of study

In this paper, we examine a number of different phrase segmentation approaches for Machine Translation and how they perform when used to supplement the translation model of a phrase-based SMT system. This work represents a summary of a number of years of research carried out at Dublin City University in which it has been found that improvements can be made using hybrid translation models. However, the level of improvement achieved is dependent on the amount of training data used. We describe the various approaches to phrase segmentation and combination explored, and outline a series of experiments investigating the relative merits of each method

DCU Online Research Access Service

Accuracy-based scoring for DOT: towards direct error minimization for data-oriented translation

Author: Galron Daniel
Melamed I. Dan
Penkale Sergio
Way Andy
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2009
Field of study

In this work we present a novel technique to rescore fragments in the Data-Oriented Translation model based on their contribution to translation accuracy. We describe three new rescoring methods, and present the initial results of a pilot experiment on a small subset of the Europarl corpus. This work is a proof-of-concept, and is the first step in directly optimizing translation decisions solely on the hypothesized accuracy of potential translations resulting from those decisions

CiteSeerX

Irish Universities

DCU Online Research Access Service

OpenMaTrEx: a free/open-source marker-driven example-based machine translation system

Author: Dandapat Sandipan
Forcada Mikel
Groves Declan
Penkale Sergio
Tinsley John
Way Andy
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/08/2010
Field of study

We describe OpenMaTrEx, a free/open-source example based machine translation (EBMT) system based on the marker hypothesis, comprising a marker-driven chunker, a collection of chunk aligners, and two engines: one based on a simple proof-of-concept monotone EBMT recombinator and a Moses-based statistical decoder. OpenMaTrEx is a free/open-source release of the basic components of MaTrEx, the Dublin City University machine translation system

DCU Online Research Access Service

MATREX: the DCU MT system for WMT 2010

Author: Banerjee Pratyush
Dandapat Sandipan
Du Jinhua
Forcada Mikel
Haque Rejwanul
Kumar Naskar Sudip
Pecina Pavel
Penkale Sergio
Srivastava Ankit Kumar
Way Andy
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 15/07/2010
Field of study

This paper describes the DCU machine translation system in the evaluation campaign of the Joint Fifth Workshop on Statistical Machine Translation and Metrics in ACL-2010. We describe the modular design of our multi-engine machine translation (MT) system with particular focus on the components used in this participation. We participated in the English–Spanish and English–Czech translation tasks, in which we employed our multiengine architecture to translate. We also participated in the system combination task which was carried out by the MBR decoder and confusion network decoder

Irish Universities

DCU Online Research Access Service

Algoritmos genéticos para análisis sintáctico

Author: Penkale Sergio.
Publication venue
Publication date: 01/01/2008
Field of study

Tesis (Lic en Ciencias de la Computación)--Universidad Nacional de Córdoba, 2008.Las técnicas tradicionales de análisis sintáctico definen modelos probabilísticos que permiten recorrer exhaustivamente el espacio de busqueda en tiempos razonables. En lugar de explicitamente definir un modelo y buscar el analisis sintactico óptimo dentro de los permitidos por el modelo, en el presente trabajo implementaremos un mecanismo que nos permitirá extender este espacio de busqueda a uno mucho mas amplio y practicamente sin restricciones.Sergio Penkale

Repositorio Digital de la Universidad Nacional de Córdoba

Incorporating translation quality-oriented features into log-linear models of machine translation

Author: Penkale Sergio
Publication venue: Dublin City University. School of Computing
Publication date: 01/11/2011
Field of study

Irish Universities